IT326 Project
_________________________________________________________________________________________________________
The Goal
Our primary objective of this analysis is to classify whatever a
student will go to college or not using the classification methods and
to identify the main factors and reasons why students are less likely to
pursue higher education indicated by “will_go_to_college” being ‘False’.
By leveraging the provided dataset with attributes such as school type,
school accreditation, gender, interest in college, residence, we aim to
discover the most influential variables and their relationships with the
decision not to attend college.
Sample of our data
head(Dataset)
Here are a sample of 6 row from our dataset.
Missing values
sum(is.na(Dataset))
[1] 0
There are no missing values in our dataset.
Statistical graphs
Graph 1:
df =data.frame(Dataset)
ggplot(data=df, aes(x = interest, fill = will_go_to_college)) +
geom_bar() +
scale_x_discrete(limits = c('Not Interested', 'Less Interested', 'Uncertain', 'Interested', 'Very Interested')) +
labs(title = 'College interest vs College attendance ') +
scale_fill_manual(values = c("True" = "antiquewhite2", "False" = "antiquewhite3")) +
theme_minimal()

NA
NA
According to the graph, whether students are interested in going to
college or not does not affect whether they actually end up attending
Collage . There is a group of individuals who were interested in
attending but did not receive acceptance, while others who were not
interested were accepted.
Graph 2:
filtered_True =filter(Dataset, will_go_to_college == 'True')
filtered_False =subset(Dataset, will_go_to_college =='False')
ggplot() +
geom_density(data = filtered_True, aes(x = average_grades, fill = "Going to College"), alpha = 0.5) +
geom_density(data = filtered_False, aes(x = average_grades, fill = "NOT Going to College"), alpha = 0.5) +
labs(x = "Average Grades", y = "Density") +
ggtitle("Comparison of Average Grades for Students Going to College and NOT Going to College") +
scale_fill_manual(values = c("Going to College" = "antiquewhite4", "NOT Going to College" = "antiquewhite1"))

The graph shows that the average grades for students who had accepted
to go to college were higher than those who did not enter college , and
this indicates the existence of a correlation between those who going to
college and the average grades
Graph 3:
Dataset_percentage <- Dataset %>%
group_by(type_school) %>%
summarise(percentage = mean(will_go_to_college == "True") * 100)
# Create a percentage chart
ggplot(Dataset_percentage, aes(x = type_school, y = percentage, fill = type_school)) +
geom_bar(stat = "identity") +
labs(title = "Percentage of Students Going to College by Type of School",
x = "Type of School",
y = "Percentage") +
scale_fill_manual(values = c("Academic" = "antiquewhite3", "Vocational" = "antiquewhite2")) +
theme_minimal()

NA
NA
This graph shows the impact of the type of high school attended by
students on their college attendance . Based on the bar chart:
among students from Academic high schools, 313 are going to
college, and 296 are not.
among students from Vocational high schools, 187 are going to
college, and 204 are not
These information tell us that a higher proportion of students from
academic high schools are going to college compared to those from
vocational schools which suggest that the type of school attended.
Graph 4:
ggplot(Dataset, aes(x = average_grades)) +
geom_histogram(binwidth = 5, fill = "antiquewhite2", color = "antiquewhite4") +
labs(title = "Distribution of students' grades",
x = "Students' average grades",
y = "Frequency") +
theme_minimal()

This histogram show us that the majority of the students in the
dataset are performing well since it seems like their grades are
spanning between 75 and 98. This analysis will help us determine whether
the academic performance level of students is a contributing factor to
their college attendance or not.
Statistical Measures
- The student’s academic performance analysis:
summary(Dataset$average_grades)
Min. 1st Qu. Median Mean 3rd Qu. Max.
75.00 83.74 85.58 86.10 88.26 98.00
The student grades in our dataset range from 75.00 to 98.00, with a
median of 85.58 and an average of 86.10. This suggests that most
students are doing well as none of them have average grades below 50.
However, it’s interesting to note that some students have much higher or
lower grades than the average, mainly due to the wide range of
grades..
- The socioeconomic status of students’ families analysis:
summary(Dataset$parent_salary)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1000000 4360000 5440000 5381570 6382500 10000000
In the dataset, we’ve got students parents with salaries ranging from
1,000,000 to 10,000,000 IDR/Rupiah. The median salary is 5,440,000
IDR/Rupiah, and the average is 5,381,570 IDR/Rupiah. This data tells us
that many parents in our dataset earn less than the average salary in
Indonesia, which is 146,000,000 IDR. This suggests that quite a few
students in our dataset come from families with limited finances. And
this financial situation could certainly impact their ability to get
through college.
summary(Dataset$house_area)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.00 64.60 75.50 74.52 84.83 120.00
Additionally we can utilize the house area attribute to gain a deeper
understanding of the socioeconomic status of students’ families, where
students with houses significantly larger than the mean might indicate a
higher socioeconomic status, while those with houses considerably
smaller than the mean might reflect a comparatively lower socioeconomic
status. Based on the shown output, the house areas range from
[20.00-120.00 ㎡]. The median house area is 75.50 ㎡ indicates that
families with house areas around this value likely have moderate
socioeconomic status with houses that neither very small nor very
large.
- Understanding Parent Age Range and Variation in its Values
summary(Dataset$parent_age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
40.00 50.00 52.00 52.21 54.00 65.00
SD=sd(Dataset$parent_age)
MeanAge=mean(Dataset$parent_age)
cat("coefficient of variation:",SD/MeanAge*100,"%")
coefficient of variation: 6.704771 %
This summary provides the range for age attribute [40,65] which
indicates that all parent in middle age during this age parent have more
concern about their children , the coefficient of variation= 6.7% which
indicates lower variation ,and the value of attribute parent_agerare are
relatively close to the mean overall 25% of them have an age below or
equal to 50 , 75% have an age below or equal to 54 and the median value
is 52
Outliers analysis
###parent age outliers
quartiles <- quantile(Dataset$parent_age, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(Dataset$parent_age)
Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR
data_no_outlier <- subset(Dataset, Dataset$parent_age > Lower & Dataset$parent_age < Upper)
dim(data_no_outlier)
[1] 957 11
###parent salary outliers
quartiles <- quantile(Dataset$parent_salary, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(Dataset$parent_salary)
Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR
data_no_outlier <- subset(data_no_outlier, data_no_outlier$parent_salary> Lower & data_no_outlier$parent_salary < Upper)
dim(data_no_outlier)
[1] 955 11
###averge grades outliers
quartiles <- quantile(Dataset$average_grades, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(Dataset$average_grades)
Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR
data_no_outlier <- subset(data_no_outlier, data_no_outlier$average_grades> Lower & data_no_outlier$average_grades < Upper)
dim(data_no_outlier)
[1] 944 11
###house area outliers
quartiles <- quantile(Dataset$house_area, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(Dataset$house_area)
Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR
data_no_outlier <- subset(data_no_outlier, data_no_outlier$house_area> Lower & data_no_outlier$house_area < Upper)
Founded_Outliers=data.frame(anti_join(Dataset,data_no_outlier))
Joining with `by = join_by(type_school, school_accreditation, gender, interest, residence, parent_age, parent_salary, house_area, average_grades, parent_was_in_college, will_go_to_college)`
print(Founded_Outliers)
After conducting data analysis and identifying outliers, our
inspection reveals that the detected outliers represent inherent
variation within the population. Regarding Parent_age, outliers are
observed for values below 44 and above 65. However, it should be noted
the age from 40 to 65 fall within the expected mean of our dataset
meaning that it doesn’t indicate that they are outliers . For
parent_salary, we found two outliers: one below 1,326,250 ind ≈ 85 USD
and another above 9,416,250 ind ≈ 606 USD. The minimum and maximum
values were determined to be 1,000,000 ind ≈ 64 USD and 10,000,000 ind ≈
644 USD, respectively. In the case of grades, twelve outliers were
identified, ranging from below 76 to above 97. Nevertheless, since the
data falls within the acceptable range of 0 to 100, these outliers
should be retained as they are still considered normal and within the
usual grade range. Finally, for house_area, we found eleven outliers
below 34.4m and above 115m, with the minimum being 20m and the maximum
being 120m. However, these values are still considered typical for the
population.
Normalization
normalize <- function(x) {return((x-min(x))/ (max(x)-min(x)))}
datasetWithoutNormalization<-Dataset
Dataset$parent_salary<-normalize(datasetWithoutNormalization$parent_salary)
Dataset$house_area<-normalize(datasetWithoutNormalization$house_area)
print(Dataset)
We applied normalization to the ‘parent_salary’ and ‘house_area’
attributes, scaling their values to a range between 0 and 1. This
normalization process greatly facilitates data handling and analysis,
ensuring that these attributes are on a consistent scale. Which will
improve the reliability of our data analysis and enable better
conclusions to be drawn from the dataset. Normalization is a crucial
step in preparing the data for modeling, as it prevents attributes with
larger numerical ranges from dominating the analysis and ensures fair
treatment for all features.
Discretization
Dataset$average_grades [Dataset$average_grades >= 95] <- '+A'
Dataset$average_grades [95 >Dataset$average_grades & Dataset$average_grades >= 90] <- 'A'
Dataset$average_grades [90 >Dataset$average_grades & Dataset$average_grades >= 85] <- '+B'
Dataset$average_grades [85 >Dataset$average_grades & Dataset$average_grades >= 80] <- 'B'
Dataset$average_grades [80 >Dataset$average_grades & Dataset$average_grades >= 75] <- '+C'
Dataset$average_grades [75 >Dataset$average_grades & Dataset$average_grades >= 70] <- 'C'
Dataset$average_grades [70 >Dataset$average_grades & Dataset$average_grades >= 65] <- '+D'
Dataset$average_grades [65 >Dataset$average_grades & Dataset$average_grades >= 60] <- 'D'
Dataset$average_grades [60 >Dataset$average_grades & Dataset$average_grades >= 0] <- 'F'
Dataset$average_grades <- as.character(Dataset$average_grades )
print(Dataset)
We transformed the parent_age attribute into intervals by dividing
the values to be fall on one of two possible interval labels with equal
width which is(40,50],(50,60] by discretization the values well be
simpler to classify or perform other methods that can help us later in
our model.
and to better utilize and interpret the grades attributes for each
student, we have converted the numeric grades into letter grades (A+, A,
B+, B, C+, C, D+, D, F). This transformation was undertaken to focus on
the general letter grade representation rather than the precise
numerical values.
Encoding
Dataset$parent_was_in_college[Dataset$parent_was_in_college=="TRUE"]<-1
Dataset$parent_was_in_college[Dataset$parent_was_in_college=="True"]<-1
Dataset$parent_was_in_college[Dataset$parent_was_in_college=="FALSE"]<-0
Dataset$parent_was_in_college[Dataset$parent_was_in_college=="False"]<-0
Dataset$will_go_to_college[Dataset$will_go_to_college=="TRUE"]<-0
Dataset$will_go_to_college[Dataset$will_go_to_college=="True"]<-0
Dataset$will_go_to_college[Dataset$will_go_to_college=="FALSE"]<-1
Dataset$will_go_to_college[Dataset$will_go_to_college=="False"]<-1
Dataset$gender[Dataset$gender=="Female"]<-1
Dataset$gender[Dataset$gender=="Male"]<-0
Dataset$school_accreditation[Dataset$school_accreditation=="A"]<-1
Dataset$school_accreditation[Dataset$school_accreditation=="B"]<-0
Dataset$interest[Dataset$interest=="Very Interested"]<-4
Dataset$interest[Dataset$interest=="Interested"]<-3
Dataset$interest[Dataset$interest=="Less Interested"]<-2
Dataset$interest[Dataset$interest=="Not Interested"]<-1
Dataset$interest[Dataset$interest=="Uncertain"]<-0
Dataset$type_school[Dataset$type_school=="Academic"]<-1
Dataset$type_school[Dataset$type_school=="Vocational"]<-0
Dataset$residence[Dataset$residence=="Urban"]<-1
Dataset$residence[Dataset$residence=="Rural"]<-0
print(Dataset)
Since encoding is an important step in data preprocessing that
enables the use of categorical data in various data analysis and machine
learning tasks, we encoded attributes like the ‘parent was in college’
attribute from (True, False) to (1, 0), and ‘will go to college’ from
(True, False) to (0, 1). This encoding is carried out as we aim to
predict the influencing factors. Additionally, we encoded the ‘gender’
attribute from (Female, Male) to (1, 0), ‘school accreditation’ from (A,
B) to (1, 0), ‘type_school’ from (Academic, Vocational) to (1, 0),
‘residence’ from (Urban, Rural) to (1, 0), and ‘interest’ from (Very
interested ,Interested , Less Interested , Not Interested ,Uncertain )
to (4,3,2, 1, 0) respectively. Encoding serves to simplify the data,
reduce complexity, and enhance its suitability for modeling
purposes.
Correlation analysis Chi square test for nominal attribute:
#1
C=chisq.test(Dataset$type_school , Dataset$will_go_to_college)
print(C)
Pearson's Chi-squared test with Yates' continuity correction
data: Dataset$type_school and Dataset$will_go_to_college
X-squared = 1.0751, df = 1, p-value = 0.2998
#2
C=chisq.test(Dataset$school_accreditation , Dataset$will_go_to_college)
print(C)
Pearson's Chi-squared test with Yates' continuity correction
data: Dataset$school_accreditation and Dataset$will_go_to_college
X-squared = 0.78513, df = 1, p-value = 0.3756
#3
C=chisq.test(Dataset$gender , Dataset$will_go_to_college)
print(C)
Pearson's Chi-squared test with Yates' continuity correction
data: Dataset$gender and Dataset$will_go_to_college
X-squared = 1.0249, df = 1, p-value = 0.3114
#4
C=chisq.test(Dataset$interest , Dataset$will_go_to_college)
print(C)
Pearson's Chi-squared test
data: Dataset$interest and Dataset$will_go_to_college
X-squared = 73.337, df = 4, p-value = 4.477e-15
#5
C=chisq.test(Dataset$residence , Dataset$will_go_to_college)
print(C)
Pearson's Chi-squared test with Yates' continuity correction
data: Dataset$residence and Dataset$will_go_to_college
X-squared = 0.016098, df = 1, p-value = 0.899
#6
C=chisq.test(Dataset$average_grades , Dataset$will_go_to_college)
print(C)
Pearson's Chi-squared test
data: Dataset$average_grades and Dataset$will_go_to_college
X-squared = 261.89, df = 4, p-value < 2.2e-16
#7
C=chisq.test(Dataset$parent_was_in_college , Dataset$will_go_to_college)
print(C)
Pearson's Chi-squared test with Yates' continuity correction
data: Dataset$parent_was_in_college and Dataset$will_go_to_college
X-squared = 2.1194, df = 1, p-value = 0.1454
All the attributes have X-square greater than the p-value which
indicate a some association with the class label; therefore we reject
the null hypothesis
we noticed for ‘interest’ and ‘average grade’ the analysis shows that
X-square is much larger than p-value indicate the significant
association of the two attributes with the decision of the student to go
to the collage or not
Correlation coefficient analysis for numeric attribute:
biserial.cor(Dataset$parent_salary,Dataset$will_go_to_college, c("all.obs", "complete.obs"), level = 1)
[1] 0.4756928
biserial.cor(Dataset$house_area,Dataset$will_go_to_college, c("all.obs", "complete.obs"), level = 1)
[1] 0.4672669
biserial.cor(Dataset$parent_age,Dataset$will_go_to_college,c("all.obs", "complete.obs"), level = 1)
[1] 0.04287336
the analysis shows moderate correlation coefficient for parent salary
and house area with the class label which indicate that they are
relevant factors meaning that the higher the parent salary and the
larger house area the higher probability for a student to enroll in a
collage
where is the on other hand, the correlation coefficient for the
parent age is very small which indicate that the parent age has little
impact to the probability for student to enroll in a collage
Feature selection:
ultimately based on the analysis of the correlation that we conducted
on the relationship of the dataset attributes with the class label, and
the understanding of the data and the context of each attribute and
potential relevance to the class label we decided to not delete any of
the attribute
Classification:
factor the data
data <- Preprocessed_dataset
data$will_go_to_college <- factor(data$will_go_to_college, levels = c("1", "0"), labels = c("1", "0"))
data$residence <- factor(data$residence, levels = c("1", "0"), labels = c("1", "0"))
data$gender <- factor(data$gender, levels = c("1", "0"), labels = c("1", "0"))
data$parent_was_in_college <- factor(data$parent_was_in_college, levels = c("1", "0"), labels = c("1", "0"))
data$interest <- factor(data$interest, levels = c("4","3","2","1", "0"), labels = c("4","3","2","1", "0"))
data$type_school <- factor(data$type_school, levels = c("1", "0"), labels = c("1", "0"))
data$school_accreditation <- factor(data$school_accreditation, levels = c("1", "0"), labels = c("1", "0"))
data$average_grades <- factor(data$average_grades, levels = c("+A", "A","+B","B","+C","C","+D","D","F"), labels = c("+A", "A","+B","B","+C","C","+D","D","F"))
str(data)
'data.frame': 1000 obs. of 11 variables:
$ type_school : Factor w/ 2 levels "1","0": 1 1 1 2 1 2 1 1 1 1 ...
$ school_accreditation : Factor w/ 2 levels "1","0": 1 1 2 2 1 2 1 2 2 2 ...
$ gender : Factor w/ 2 levels "1","0": 2 2 1 2 1 1 2 2 1 1 ...
$ interest : Factor w/ 5 levels "4","3","2","1",..: 3 3 1 1 1 3 1 1 5 1 ...
$ residence : Factor w/ 2 levels "1","0": 1 1 1 2 1 2 2 2 2 2 ...
$ parent_age : int 56 57 50 49 57 48 52 53 52 47 ...
$ parent_salary : num 0.661 0.379 0.611 0.622 0.472 ...
$ house_area : num 0.63 0.568 0.606 0.582 0.551 0.453 0.655 0.633 0.603 0.48 ...
$ average_grades : Factor w/ 9 levels "+A","A","+B",..: 4 3 3 4 3 3 2 4 3 3 ...
$ parent_was_in_college: Factor w/ 2 levels "1","0": 2 2 2 1 2 1 1 1 1 1 ...
$ will_go_to_college : Factor w/ 2 levels "1","0": 2 2 2 2 1 1 2 1 2 1 ...
balanced or imbalanced
library(tidyverse)
Error in library(tidyverse) : there is no package called ‘tidyverse’
we want to confirm that the distribution between the two label data
is not too much different. Because imbalanced datasets can lead to
imbalanced accuracy.
Fortunately ,our data is balanced
partition method
We opted for cross-validation as our partition method owing to the
constraints posed by limited data availability. To ensure robustness in
our evaluation, we employed three distinct values for k folds 2, 3, and
4. we chose small k folds because of our small data
c4.5
library(caret)
Loading required package: lattice
Registered S3 method overwritten by 'data.table':
method from
print.data.table
library(rpart)
library(dplyr)
library(rpart.plot)
Error in library(rpart.plot) : there is no package called ‘rpart.plot’
The gain ratio consistently favors unbalanced splits, as demonstrated
by its selection of “Parent salary” as the root for all three trees even
though it’s shown in the tree “average grades” as the root but the split
point’s that all the value in one diraction . In this configuration, one
partition is notably smaller than the others, and the feature exhibits a
higher number of distinct values. Despite the fact that the node
corresponding to “Parent age”, “parent was in collage” is not pure, the
resulting trees exhibit impressive accuracy levels, all surpassing
94%
cart
library(caret)
library(rpart)
library(rpart.plot)
Error in library(rpart.plot) : there is no package called ‘rpart.plot’
averages grade exhibits the smallest Gini index binary split,
signifying a substantial reduction in impurity. Hence, it is chosen as
the splitting attribute. Conversely, attributes such as ‘type_school,’
‘school_accreditation,’ ‘gender,’ ‘parent_age,’ and
‘parent_was_in_college’ yield minimal impurity reduction, leading to
their exclusion from the tree. The dataset’s balanced class labels and
marginal differences in accuracy across folds result in consistent tree
structures, as evidenced by the identical trees in all folds. For
further details, refer to the index. Overall, the
model attains an 86% accuracy, emphasizing its effectiveness.
ID3
library(caret)
library(partykit)
Error in library(partykit) : there is no package called ‘partykit’
Additional insights reveal that attributes such as school
accreditation, parent was in collage contribute to high impurity. In
contrast, Parent salary is chosen as the root due to its high purity.
Given the balanced class labels in our dataset and minimal variations in
accuracy across folds, the result yields consistent tree structures,
with only two distinct trees observed for all folds. For further
details, please refer to the index. The overall
accuracy consistently surpasses 86%, affirming the model’s efficacy
final analysis
The C4.5 model emerged as the top-performing evaluation model,
achieving an impressive accuracy rate of 94% to 97%. It was followed by
the ID3 model, which demonstrated slightly lower accuracy ranging from
86% to 89%. Lastly, the cart model exhibited an accuracy rate of
86%.
the C4.5 gave better result than ID3 and Cart because they both are
biased to multivalued where C4.5 normalized parent salary and house area
which are multivalue attributes
C4.5 and ID3 models, the parent’s salary served as the root feature,
indicating that the financial circumstances of the student are a crucial
factor for contemporary universities.
index
all trees of C4.5
library(caret)
library(rpart)
library(dplyr)
library(rpart.plot)
Error in library(rpart.plot) : there is no package called ‘rpart.plot’
Clustering Analysis:
In this analysis, we apply K-means clustering to the dataset using
different values of K. K-means clustering is an unsupervised learning
algorithm that partitions the data into K clusters based on similarity.
We will explore three different values of K and evaluate the clustering
results using various metrics.
Removing the class label and preparing the dataset for
Clustering
original_data <- Preprocessed_dataset
# Remove any non-numeric attributes
numeric_data <- original_data[, sapply(original_data, is.numeric)]
# Remove the class label 'will_go_to_college'
numeric_data <- numeric_data[, !(names(numeric_data) == 'will_go_to_college')]
# Print the dataset "numeric_data" to make sure it's prepared for clustering
print(numeric_data)
# Scaling the dataset
# numeric_data <- scale(numeric_data)
Now, the ‘numeric_dataset’ dataset contains only numeric attributes
without the class label, which makes it ready for the clustering
process.
K=2
# k-means clustering set a seed for random number generation to make the results reproducible
set.seed(8953)
# run kmeans clustering to find 2 clusters
kmeans.result <- kmeans(numeric_data, 2)
# visualize clustering
library(factoextra)
fviz_cluster(kmeans.result, data = numeric_data)

# print the clustering result
print(kmeans.result)
K-means clustering with 2 clusters of sizes 531, 469
Cluster means:
type_school school_accreditation gender interest residence parent_age parent_salary house_area
1 0.5856874 0.4067797 0.5235405 2.001883 0.3163842 49.67985 0.5219000 0.5351620
2 0.6353945 0.5650320 0.4413646 2.296375 0.7910448 55.07036 0.4471476 0.5564648
parent_was_in_college
1 0.700565
2 0.315565
Clustering vector:
[1] 2 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 1 1 1 1 1 2 1 2 2 1 2 2 1 1 1 2 2 1 1 2 2 1 2 2 1 2 2 1 1 1 1
[48] 2 2 2 1 2 1 1 1 2 2 1 1 2 1 1 1 1 2 2 1 1 1 2 2 1 1 2 2 2 2 2 1 1 2 1 2 1 2 2 2 1 1 2 2 2 2 1
[95] 1 1 1 2 1 1 1 2 1 1 1 2 1 2 2 2 1 1 1 1 1 1 1 2 2 2 1 1 1 1 2 2 1 1 1 2 2 1 2 1 1 1 2 1 2 2 2
[142] 2 1 1 2 1 1 2 1 1 1 1 2 1 2 1 2 1 2 2 2 1 2 2 1 2 2 2 1 1 2 1 2 1 2 1 1 1 1 1 2 1 2 1 1 2 2 2
[189] 2 2 2 1 1 2 2 2 1 2 2 2 2 2 2 2 1 2 2 1 2 1 1 1 1 2 2 1 2 1 1 1 2 1 2 1 2 1 2 1 2 1 2 2 2 1 1
[236] 2 2 1 1 2 2 1 1 1 2 2 1 1 1 2 2 1 1 2 1 1 1 2 1 2 2 2 1 1 2 1 2 1 2 1 1 1 1 2 1 1 1 2 1 2 1 1
[283] 2 1 1 2 2 2 1 1 2 1 2 1 1 2 1 1 1 2 1 2 1 1 1 1 1 2 1 2 1 2 2 2 2 2 1 2 2 1 2 2 2 1 1 2 1 2 1
[330] 2 2 1 1 2 2 2 1 1 2 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1
[377] 2 1 2 2 1 2 1 2 2 1 2 2 1 2 1 2 1 1 1 2 1 1 1 1 2 2 1 2 1 2 2 1 2 1 1 2 1 2 1 2 2 1 2 2 1 1 2
[424] 2 2 2 1 2 2 2 1 1 1 2 2 1 2 1 2 1 2 1 2 2 2 1 1 1 1 2 1 1 1 2 1 2 1 1 2 1 1 2 2 1 1 1 1 1 2 1
[471] 2 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 2 2 1 1 1 1 1 2 1 2 1 2 1 2 2 2 2 1 2 2 2 1 2 2 1 1 1 2 2 2 2
[518] 1 1 2 1 2 2 1 1 1 2 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1 1 2 1 1 1 1 2 2 1 2 2 2 2 1 1 1 1 1 1 2 2 1
[565] 2 1 2 2 1 2 2 2 2 1 1 2 2 1 1 2 1 1 2 1 2 2 1 1 1 2 1 2 2 2 1 1 2 2 1 2 2 2 1 2 2 1 2 1 2 1 1
[612] 1 1 1 2 2 1 1 2 2 1 1 2 1 2 2 1 1 2 2 1 1 2 1 2 2 1 1 1 2 1 2 1 2 2 2 2 1 1 2 2 2 2 1 1 1 2 2
[659] 2 1 2 2 1 2 1 1 1 1 2 1 1 1 1 2 2 1 2 2 2 2 1 2 1 2 2 2 1 1 1 2 1 1 1 1 2 1 1 2 1 1 2 1 2 1 1
[706] 1 1 2 1 2 1 1 2 1 1 2 2 2 1 1 1 2 2 1 2 2 1 1 2 1 2 1 2 1 1 1 1 2 1 2 1 1 1 1 2 1 2 2 1 2 1 1
[753] 2 2 1 1 1 1 2 1 2 2 1 2 2 2 2 2 1 2 2 2 1 1 1 1 2 1 2 1 1 2 2 2 2 2 2 1 1 2 1 2 2 2 2 1 1 2 2
[800] 1 1 1 2 1 2 2 1 2 2 1 1 2 1 1 1 1 1 1 1 2 2 2 2 1 2 1 1 2 1 2 1 1 2 2 1 1 1 2 1 1 2 1 2 2 2 1
[847] 1 2 2 1 1 2 1 2 2 2 1 2 1 1 2 2 2 2 1 2 2 1 1 2 2 1 1 2 2 1 2 2 1 2 1 2 2 1 2 2 1 2 2 1 1 2 1
[894] 2 1 2 2 1 1 1 1 1 2 2 2 1 1 1 2 1 2 2 1 2 2 2 1 1 2 1 1 2 2 1 1 1 2 2 2 2 1 1 2 2 1 1 2 2 1 2
[941] 2 2 2 1 1 1 2 2 2 2 2 1 1 1 1 1 2 2 1 2 1 1 1 2 2 1 1 1 1 2 2 2 1 2 2 1 2 1 2 1 1 1 1 1 2 2 2
[988] 1 2 1 2 1 2 2 1 1 1 1 2 1
Within cluster sum of squares by cluster:
[1] 5022.712 3646.347
(between_SS / total_SS = 45.9 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss"
[7] "size" "iter" "ifault"
The Silhouette coefficient
#average for each cluster
avg_sil <- silhouette(kmeans.result$cluster, dist(numeric_data))
#k-means clustering with estimating k and initializations
fviz_silhouette(avg_sil)
NA

The total within-cluster sum of squares
# Calculate total within-cluster sum of squares
total_withinss <- kmeans.result$tot.withinss
cat("Total Within-Cluster Sum of Squares:", sum(total_withinss), "\n")
Total Within-Cluster Sum of Squares: 8669.059
true_labels <- c(1, 1, 2, 1, 2, 2, 3, 3, 4, 4) # Adjust based on your actual true labels
cluster_assignments <- kmeans.result$cluster
BCubed recall and precision
# Calculate BCubed precision
precision <- 0
for (i in unique(true_labels)) {
cluster_indices <- which(true_labels == i)
precision <- precision + sum((table(cluster_assignments[cluster_indices]) * (table(cluster_assignments[cluster_indices]) - 1)) / sum(table(cluster_assignments[cluster_indices])))
}
precision <- precision / sum(table(cluster_assignments))
# Calculate BCubed recall
recall <- 0
for (j in unique(cluster_assignments)) {
cluster_indices <- which(cluster_assignments == j)
recall <- recall + sum((table(true_labels[cluster_indices]) * (table(true_labels[cluster_indices]) - 1)) / sum(table(true_labels[cluster_indices])))
}
recall <- recall / sum(table(true_labels))
cat("BCubed Precision:", precision, "\n")
BCubed Precision: 0.002333333
cat("BCubed Recall:", recall, "\n")
BCubed Recall: 0.1166667
K=4
# k-means clustering set a seed for random number generation to make the results reproducible
set.seed(8953)
# run kmeans clustering to find 4 clusters
kmeans.result <- kmeans(numeric_data, 4)
# visualize clustering
library(factoextra)
fviz_cluster(kmeans.result, data = numeric_data)

# print the clustering result
print(kmeans.result)
K-means clustering with 4 clusters of sizes 167, 66, 293, 474
Cluster means:
type_school school_accreditation gender interest residence parent_age parent_salary house_area
1 0.6467066 0.5688623 0.3832335 2.245509 0.8323353 55.70060 0.4286427 0.5677605
2 0.6818182 0.6515152 0.5151515 2.181818 1.0000000 59.21212 0.3625589 0.5229545
3 0.5255973 0.3549488 0.5631399 1.928328 0.1979522 48.11604 0.5310315 0.5202048
4 0.6371308 0.5042194 0.4683544 2.227848 0.5822785 52.53165 0.4973347 0.5557004
parent_was_in_college
1 0.2455090
2 0.1060606
3 0.8122867
4 0.4936709
Clustering vector:
[1] 1 1 3 3 1 3 4 4 4 3 1 3 3 4 4 4 3 3 3 3 4 1 4 2 4 4 1 4 4 3 3 4 2 4 4 4 1 3 4 4 3 4 4 4 3 3 4
[48] 1 4 4 3 1 4 4 4 1 1 3 3 1 3 3 4 3 1 4 4 4 4 4 4 3 3 1 4 1 1 4 3 4 4 3 2 4 1 1 1 4 4 4 1 4 2 4
[95] 3 3 3 4 4 4 4 1 4 4 4 4 3 4 2 1 4 4 4 4 3 3 3 4 2 1 3 3 4 3 1 4 3 3 3 4 1 3 4 3 3 3 4 3 4 4 4
[142] 2 3 3 4 3 3 1 3 3 3 3 2 4 1 3 1 4 1 4 4 4 4 1 3 4 4 2 3 3 1 3 2 3 4 3 4 4 4 4 4 4 1 3 3 4 1 4
[189] 4 4 4 3 4 1 4 1 3 4 1 4 1 2 4 1 4 4 2 3 4 3 3 3 3 4 1 4 4 4 3 4 2 4 1 3 4 4 4 3 4 4 4 4 4 4 4
[236] 2 1 4 4 1 1 4 3 4 4 4 3 4 3 1 2 3 3 1 3 3 4 4 3 4 2 4 3 3 4 3 1 3 4 3 3 4 3 4 4 3 3 4 3 1 4 4
[283] 1 3 3 4 1 4 4 4 2 4 2 4 3 1 3 4 3 4 4 4 3 3 4 3 3 4 3 1 3 4 4 1 4 4 4 1 1 3 2 2 1 3 3 1 3 4 3
[330] 4 4 4 4 2 4 4 3 4 4 3 4 4 3 4 4 3 4 1 4 4 4 4 4 3 1 3 4 1 4 3 3 3 3 3 4 3 3 3 3 1 1 1 2 4 4 4
[377] 1 3 4 4 4 4 3 1 4 4 1 2 3 1 4 2 3 3 3 2 4 3 4 3 4 1 3 4 3 4 4 4 1 3 4 1 4 2 3 1 4 4 4 4 3 3 1
[424] 1 2 1 3 1 1 4 3 4 3 1 1 3 4 3 4 4 4 3 2 1 4 3 3 3 4 4 4 4 4 4 3 4 4 4 4 4 4 2 2 3 4 3 3 4 4 4
[471] 4 3 4 4 4 4 4 3 4 3 1 3 3 3 2 4 4 1 4 4 3 4 3 4 3 4 3 2 3 4 4 4 4 4 4 1 1 3 1 4 3 3 4 2 1 4 1
[518] 3 3 1 4 2 1 4 3 4 4 3 3 3 3 1 4 3 2 4 3 3 3 4 3 4 4 4 3 4 3 4 1 1 4 2 4 1 4 3 4 3 4 3 3 4 4 3
[565] 4 4 4 2 4 4 1 1 1 3 4 1 4 3 3 4 4 4 4 3 4 1 4 4 4 4 3 1 4 4 4 3 1 4 4 1 1 1 4 4 4 3 1 4 4 3 4
[612] 3 4 3 2 1 3 4 2 4 4 4 1 3 1 2 3 3 2 4 4 3 2 4 4 1 4 3 4 2 3 2 4 4 4 1 4 4 4 4 1 4 2 4 4 3 4 4
[659] 4 3 2 4 3 2 4 4 3 3 1 4 4 3 3 1 1 3 1 2 1 4 4 1 3 4 1 2 4 4 4 4 4 3 3 3 4 4 4 4 3 3 4 4 4 3 4
[706] 4 4 1 3 4 3 4 1 3 3 4 1 2 3 4 3 2 4 3 1 1 3 3 4 3 4 4 4 3 3 3 4 4 4 4 3 3 3 3 1 4 1 4 3 4 3 4
[753] 4 1 4 3 3 3 4 4 1 4 4 4 1 4 2 2 3 4 4 4 3 4 3 3 4 3 2 4 4 4 4 4 4 4 4 4 4 4 4 1 4 1 4 3 3 1 4
[800] 3 3 3 4 3 4 4 4 4 4 4 3 1 3 3 3 3 3 4 3 4 4 4 1 4 2 4 4 1 4 2 3 3 1 1 3 4 4 1 3 3 4 4 1 1 1 4
[847] 4 1 1 3 3 1 3 4 2 4 3 4 3 3 4 4 4 2 3 4 2 4 3 4 1 4 3 4 4 4 4 4 4 4 3 2 4 3 2 1 4 4 4 3 3 1 4
[894] 4 3 1 4 3 4 3 4 4 4 2 4 3 4 4 2 4 4 1 4 4 2 4 4 3 4 3 3 1 1 4 4 4 1 4 1 1 3 4 1 1 3 4 4 1 4 1
[941] 1 1 1 4 4 3 4 4 4 4 2 4 4 4 3 4 1 1 3 1 4 4 3 1 1 3 3 4 4 4 1 1 4 4 4 3 1 4 1 4 4 4 3 3 1 4 2
[988] 3 4 3 4 4 4 4 4 3 4 3 4 3
Within cluster sum of squares by cluster:
[1] 640.8068 291.1379 2380.6791 2338.2145
(between_SS / total_SS = 64.7 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss"
[7] "size" "iter" "ifault"
The Silhouette coefficient
#average for each cluster
avg_sil <- silhouette(kmeans.result$cluster, dist(numeric_data))
#k-means clustering with estimating k and initializations
fviz_silhouette(avg_sil)
NA

The total within-cluster sum of squares
# Calculate total within-cluster sum of squares
total_withinss <- kmeans.result$tot.withinss
cat("Total Within-Cluster Sum of Squares:", sum(total_withinss), "\n")
Total Within-Cluster Sum of Squares: 5650.838
true_labels <- c(1, 1, 2, 1, 2, 2, 3, 3, 4, 4) # Adjust based on your actual true labels
cluster_assignments <- kmeans.result$cluster
BCubed recall and precision
# Calculate BCubed precision
precision <- 0
for (i in unique(true_labels)) {
cluster_indices <- which(true_labels == i)
precision <- precision + sum((table(cluster_assignments[cluster_indices]) * (table(cluster_assignments[cluster_indices]) - 1)) / sum(table(cluster_assignments[cluster_indices])))
}
precision <- precision / sum(table(cluster_assignments))
# Calculate BCubed recall
recall <- 0
for (j in unique(cluster_assignments)) {
cluster_indices <- which(cluster_assignments == j)
recall <- recall + sum((table(true_labels[cluster_indices]) * (table(true_labels[cluster_indices]) - 1)) / sum(table(true_labels[cluster_indices])))
}
recall <- recall / sum(table(true_labels))
cat("BCubed Precision:", precision, "\n")
BCubed Precision: 0.002333333
cat("BCubed Recall:", recall, "\n")
BCubed Recall: 0.1833333
K=6
# k-means clustering set a seed for random number generation to make the results reproducible
set.seed(8953)
# run kmeans clustering to find 6 clusters
kmeans.result <- kmeans(numeric_data, 6)
# visualize clustering
library(factoextra)
fviz_cluster(kmeans.result, data = numeric_data)

# print the clustering result
print(kmeans.result)
K-means clustering with 6 clusters of sizes 104, 91, 213, 233, 239, 120
Cluster means:
type_school school_accreditation gender interest residence parent_age parent_salary house_area
1 0.5288462 0.5288462 0.4134615 0.8365385 0.6442308 55.15385 0.4498611 0.5810192
2 0.7362637 0.6813187 0.4725275 2.3846154 1.0000000 58.60440 0.3728205 0.5303736
3 0.6666667 0.4741784 0.5258216 3.5399061 0.4835681 50.52113 0.5166093 0.5257512
4 0.6952790 0.5536481 0.4420601 3.2703863 0.8583691 54.01717 0.4693371 0.5510987
5 0.5271967 0.3849372 0.4937238 0.5020921 0.2677824 51.26778 0.5096885 0.5787364
6 0.4750000 0.3500000 0.5500000 1.6666667 0.1166667 46.15833 0.5410000 0.4812833
parent_was_in_college
1 0.4326923
2 0.0989011
3 0.5446009
4 0.2746781
5 0.7740586
6 0.8416667
Clustering vector:
[1] 1 2 3 3 2 6 3 4 5 6 1 6 6 4 3 4 3 3 6 6 3 4 3 2 4 5 4 5 5 5 3 4 2 5 3 4 1 3 4 5 5 4 4 3 3 5 5
[48] 4 1 5 6 1 5 3 5 4 1 6 6 4 5 3 5 3 4 4 5 3 5 4 4 5 6 2 1 4 1 4 5 3 4 5 2 5 4 4 1 5 3 4 1 4 2 5
[95] 6 3 6 4 3 5 5 4 5 3 3 5 3 4 2 2 5 5 5 3 3 6 5 4 2 1 3 6 3 6 1 4 6 3 5 4 4 6 5 3 5 5 4 3 4 4 4
[142] 2 6 6 5 3 5 1 6 3 6 6 2 5 1 3 1 5 1 4 1 3 4 4 5 5 1 2 3 3 1 5 2 3 4 3 3 5 3 5 4 3 1 5 6 1 1 4
[189] 4 4 4 6 5 4 1 4 6 5 1 4 1 2 4 4 3 1 2 6 4 3 6 5 6 4 1 3 4 3 5 3 2 3 2 5 5 5 5 6 4 5 1 5 5 5 5
[236] 2 1 5 3 2 1 3 5 5 4 4 6 5 5 1 2 3 5 4 5 6 3 4 6 4 2 4 6 3 4 5 2 5 4 5 6 5 6 4 5 6 5 4 3 1 5 3
[283] 2 3 3 4 1 4 5 3 2 5 2 3 3 1 5 3 6 5 3 4 6 5 5 5 6 5 6 1 3 4 5 4 4 4 3 1 1 6 2 2 4 6 5 4 6 4 6
[330] 4 4 3 5 2 4 1 6 5 5 5 5 5 3 3 5 6 5 4 3 4 3 5 5 6 1 6 3 2 5 6 3 6 5 5 5 6 3 3 6 1 2 2 2 4 3 3
[377] 1 5 4 4 3 4 3 1 4 3 4 2 3 1 5 2 6 3 3 2 3 5 3 6 4 4 5 4 6 4 4 5 4 3 3 4 5 2 3 1 5 3 5 4 3 6 4
[424] 1 2 2 6 1 1 4 6 3 6 2 4 6 4 6 4 3 5 5 2 2 4 6 5 6 5 4 3 3 5 4 3 4 3 3 4 5 5 2 2 3 3 5 3 3 4 3
[471] 4 6 4 4 5 3 5 6 5 6 1 6 3 5 2 3 4 4 3 5 6 5 3 4 3 4 6 2 6 4 4 4 4 5 1 4 4 6 2 4 6 3 5 2 1 4 1
[518] 3 6 1 5 2 1 3 5 5 4 5 6 6 6 4 3 3 2 4 3 3 6 3 3 5 3 4 3 3 5 5 1 1 5 2 4 4 5 6 3 6 5 3 6 4 4 6
[565] 5 3 4 2 3 4 2 4 4 5 3 4 4 5 5 4 5 5 4 6 5 4 3 5 5 4 6 1 1 4 5 6 1 4 3 4 4 1 3 4 1 3 4 5 4 3 3
[612] 5 5 6 2 1 3 5 2 5 3 5 1 6 1 2 3 3 2 4 5 5 2 5 4 4 5 3 5 2 3 2 5 4 5 1 4 5 3 1 1 4 2 3 3 6 4 5
[659] 4 5 2 4 3 2 3 3 6 3 4 5 3 5 5 4 2 3 1 2 1 4 5 4 6 4 1 2 3 3 3 4 5 3 6 3 5 3 5 5 5 3 4 3 5 6 5
[706] 5 3 4 5 4 3 3 2 6 3 4 1 2 5 3 3 2 4 3 1 2 6 6 4 3 1 3 4 3 6 6 3 4 3 5 6 5 6 5 4 3 1 4 6 5 5 5
[753] 5 1 3 5 3 3 4 5 1 1 3 4 4 5 2 2 5 4 4 4 5 5 3 3 5 6 2 5 5 4 5 5 4 1 4 5 3 5 5 1 1 1 5 3 6 4 4
[800] 3 3 5 4 5 4 4 5 4 1 5 6 2 6 3 6 3 6 5 5 4 4 4 4 5 2 3 5 1 3 2 3 5 1 4 6 5 5 1 5 6 4 5 1 1 1 3
[847] 3 1 4 5 3 4 6 4 2 4 6 4 3 3 4 4 4 2 5 4 2 5 3 1 4 3 6 4 4 3 4 1 3 4 5 2 1 3 2 1 5 4 4 6 6 4 5
[894] 4 6 1 4 3 5 5 3 5 4 2 4 6 3 5 2 5 4 1 3 5 2 5 5 5 4 3 3 1 4 5 5 5 4 4 4 1 6 3 1 4 6 5 4 2 3 1
[941] 1 1 4 3 5 3 1 4 4 4 2 3 3 3 5 5 1 1 5 1 5 3 6 4 2 3 5 5 5 5 4 2 5 4 4 6 4 3 2 5 5 3 3 6 2 4 2
[988] 6 4 3 4 3 4 4 5 3 5 3 5 3
Within cluster sum of squares by cluster:
[1] 285.6366 426.7825 722.6048 655.5483 782.3111 828.9997
(between_SS / total_SS = 76.9 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss"
[7] "size" "iter" "ifault"
The Silhouette coefficient
#average for each cluster
avg_sil <- silhouette(kmeans.result$cluster, dist(numeric_data))
#k-means clustering with estimating k and initializations
fviz_silhouette(avg_sil)
NA

The total within-cluster sum of squares
# Calculate total within-cluster sum of squares
total_withinss <- kmeans.result$tot.withinss
cat("Total Within-Cluster Sum of Squares:", sum(total_withinss), "\n")
Total Within-Cluster Sum of Squares: 3701.883
true_labels <- c(1, 1, 2, 1, 2, 2, 3, 3, 4, 4) # Adjust based on your actual true labels
cluster_assignments <- kmeans.result$cluster
BCubed recall and precision
# Calculate BCubed precision
precision <- 0
for (i in unique(true_labels)) {
cluster_indices <- which(true_labels == i)
precision <- precision + sum((table(cluster_assignments[cluster_indices]) * (table(cluster_assignments[cluster_indices]) - 1)) / sum(table(cluster_assignments[cluster_indices])))
}
precision <- precision / sum(table(cluster_assignments))
# Calculate BCubed recall
recall <- 0
for (j in unique(cluster_assignments)) {
cluster_indices <- which(cluster_assignments == j)
recall <- recall + sum((table(true_labels[cluster_indices]) * (table(true_labels[cluster_indices]) - 1)) / sum(table(true_labels[cluster_indices])))
}
recall <- recall / sum(table(true_labels))
cat("BCubed Precision:", precision, "\n")
BCubed Precision: 0
cat("BCubed Recall:", recall, "\n")
BCubed Recall: 0
The optimal number of clusters
To find the optimal number of clusters to use in the k-means
algorithm, we’ll use the fviz_nbclust() function from
the factoextrapackage to create a plot of the number of
clusters vs. the total within sum of squares
# Function to calculate total within-cluster sum of squares (wss)
wss <- function(k) {
kmeans_result <- kmeans(numeric_data, centers = k, nstart = 10) # You can adjust nstart based on your preference
return(sum(kmeans_result$tot.withinss))
}
# Calculate the total within-cluster sum of squares for different values of k
k_values <- 1:10 # You can adjust the range of k values
wss_values <- sapply(k_values, wss)
# Plot the elbow curve
plot(k_values, wss_values, type = "b", pch = 19, frame = FALSE,
xlab = "Number of Clusters (k)", ylab = "Total Within-Cluster Sum of Squares (WSS)",
main = "Elbow Method")
# Adding a line to indicate the "elbow"
abline(v = which(diff(wss_values) == max(diff(wss_values))) + 1, col = "red")

NA
NA
According to the output the best number of clusters is one,
---
title: "R Notebook"
output: html_notebook
---

Important packages:

```{r}
setwd("Dataset")
Dataset <- read.csv("data.csv")
Preprocessed_dataset <-  read.csv("preprocessed_dataset.csv") 
if(!require(ggplot2)){
install.packages("ggplot2")}
library(ggplot2)
if(!require(dplyr)){
install.packages("dplyr")}
library(dplyr) 
if(!require(dplyr)){
install.packages("dplyr")}
library(dplyr)
if(!require(ltm)){
install.packages("ltm")}
library(ltm)
if (!require(cluster, quietly = TRUE)) {
  install.packages("cluster")
}
if (!require(factoextra, quietly = TRUE)) {
  install.packages("factoextra")
}
 
```

# IT326 Project {style="color: gray"}

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

## The Goal

Our primary objective of this analysis is to classify whatever a student will go to college or not using the classification methods and to identify the main factors and reasons why students are less likely to pursue higher education indicated by "will_go_to_college" being 'False'. By leveraging the provided dataset with attributes such as school type, school accreditation, gender, interest in college, residence, we aim to discover the most influential variables and their relationships with the decision not to attend college.

## The Source

Kaggle.com

## URL :

<https://www.kaggle.com/datasets/saddamazyazy/go-to-college-dataset>

## General information

-   Number of attributes : 11
-   Number of rows (objects) : 1000
-   The class label: The class label in our project is the attribute "will_go_to_college". This attribute is binary, meaning that it can take on two values: True for yes or False for no. The value of this attribute will be the target variable that we are trying to predict during our project.

------------------------------------------------------------------------

+-----------------------+------------------------------------------------------------+---------+---------------------------------------------------------------------------+
| Attribute             | Description                                                | Type    | Possible values                                                           |
+-----------------------+------------------------------------------------------------+---------+---------------------------------------------------------------------------+
| type_school           | The type of school the student attends                     | Binary  | Academic / Vocational                                                     |
+-----------------------+------------------------------------------------------------+---------+---------------------------------------------------------------------------+
| school_accreditation  | The quality if school                                      | Binary  | A / B (A is better than B)                                                |
+-----------------------+------------------------------------------------------------+---------+---------------------------------------------------------------------------+
| gender                | The student's gender                                       | Binary  | Male / Female                                                             |
+-----------------------+------------------------------------------------------------+---------+---------------------------------------------------------------------------+
| interest              | The student's interest in going to college                 | Nominal | Very interested /Interested / Less Interested / Not Interested /Uncertain |
+-----------------------+------------------------------------------------------------+---------+---------------------------------------------------------------------------+
| residence             | The student's residence                                    | Binary  | Urban / Rural                                                             |
+-----------------------+------------------------------------------------------------+---------+---------------------------------------------------------------------------+
| parent_age            | The age of the student's parents                           | Numeric | 40 - 65                                                                   |
+-----------------------+------------------------------------------------------------+---------+---------------------------------------------------------------------------+
| parent_salary         | The monthly salary of the student's parents in IDR/Rupiah. | Numeric | 1000K - 10M                                                               |
|                       |                                                            |         |                                                                           |
|                       | [1Rupiah = 0.00024SAR]                                     |         |                                                                           |
+-----------------------+------------------------------------------------------------+---------+---------------------------------------------------------------------------+
| house_area            | The size of the student's house in meter square            | Numeric | 20 - 120                                                                  |
+-----------------------+------------------------------------------------------------+---------+---------------------------------------------------------------------------+
| average_grades        | The student's average grades in school.                    | Numeric | 75 - 98                                                                   |
+-----------------------+------------------------------------------------------------+---------+---------------------------------------------------------------------------+
| parent_was_in_college | Whether the student's parents attended college.            | Binary  | True - False                                                              |
+-----------------------+------------------------------------------------------------+---------+---------------------------------------------------------------------------+
| will_go_to_college    | Whether the student plans to go to college.                | Binary  | True - False                                                              |
+-----------------------+------------------------------------------------------------+---------+---------------------------------------------------------------------------+

## Sample of our data

```{r}
head(Dataset)
```

Here are a sample of 6 row from our dataset.

## Missing values

```{r}
sum(is.na(Dataset))
```

There are no missing values in our dataset.

## Statistical graphs

Graph 1:

```{r}
df =data.frame(Dataset)
ggplot(data=df, aes(x = interest, fill = will_go_to_college)) +
  geom_bar() +
  scale_x_discrete(limits = c('Not Interested', 'Less Interested', 'Uncertain', 'Interested', 'Very Interested')) +
  labs(title = 'College interest vs College attendance ') +
  scale_fill_manual(values = c("True" = "antiquewhite2", "False" = "antiquewhite3")) +
  theme_minimal()


```

According to the graph, whether students are interested in going to college or not does not affect whether they actually end up attending Collage . There is a group of individuals who were interested in attending but did not receive acceptance, while others who were not interested were accepted.

Graph 2:

```{r}

filtered_True =filter(Dataset, will_go_to_college == 'True')
filtered_False =subset(Dataset, will_go_to_college =='False')

ggplot() +
  geom_density(data = filtered_True, aes(x = average_grades, fill = "Going to College"), alpha = 0.5) +
  geom_density(data = filtered_False, aes(x = average_grades, fill = "NOT Going to College"), alpha = 0.5) +
  labs(x = "Average Grades", y = "Density") +
  ggtitle("Comparison of Average Grades for Students Going to College and NOT Going to College") +
  scale_fill_manual(values = c("Going to College" = "antiquewhite4", "NOT Going to College" = "antiquewhite1"))

```

The graph shows that the average grades for students who had accepted to go to college were higher than those who did not enter college , and this indicates the existence of a correlation between those who going to college and the average grades

Graph 3:

```{r}
Dataset_percentage <- Dataset %>%
  group_by(type_school) %>%
  summarise(percentage = mean(will_go_to_college == "True") * 100)

# Create a percentage chart
ggplot(Dataset_percentage, aes(x = type_school, y = percentage, fill = type_school)) +
  geom_bar(stat = "identity") +
  labs(title = "Percentage of Students Going to College by Type of School",
       x = "Type of School",
       y = "Percentage") +
  scale_fill_manual(values = c("Academic" = "antiquewhite3", "Vocational" = "antiquewhite2")) +
  theme_minimal()

 
```

This graph shows the impact of the type of high school attended by students on their college attendance . Based on the bar chart:

-   among students from Academic high schools, 313 are going to college, and 296 are not.

-   among students from Vocational high schools, 187 are going to college, and 204 are not

These information tell us that a higher proportion of students from academic high schools are going to college compared to those from vocational schools which suggest that the type of school attended.

Graph 4:

```{r}
ggplot(Dataset, aes(x = average_grades)) +
  geom_histogram(binwidth = 5, fill = "antiquewhite2", color = "antiquewhite4") +
  labs(title = "Distribution of students' grades",
       x = "Students' average grades",
       y = "Frequency") +
  theme_minimal()
```

This histogram show us that the majority of the students in the dataset are performing well since it seems like their grades are spanning between 75 and 98. This analysis will help us determine whether the academic performance level of students is a contributing factor to their college attendance or not.

## Statistical Measures

1.  The student's academic performance analysis:

```{r}
summary(Dataset$average_grades)

```

The student grades in our dataset range from 75.00 to 98.00, with a median of 85.58 and an average of 86.10. This suggests that most students are doing well as none of them have average grades below 50. However, it's interesting to note that some students have much higher or lower grades than the average, mainly due to the wide range of grades..

2.  The socioeconomic status of students' families analysis:

```{r}
summary(Dataset$parent_salary)
```

In the dataset, we've got students parents with salaries ranging from 1,000,000 to 10,000,000 IDR/Rupiah. The median salary is 5,440,000 IDR/Rupiah, and the average is 5,381,570 IDR/Rupiah. This data tells us that many parents in our dataset earn less than the average salary in Indonesia, which is 146,000,000 IDR. This suggests that quite a few students in our dataset come from families with limited finances. And this financial situation could certainly impact their ability to get through college.

```{r}
summary(Dataset$house_area)
```

Additionally we can utilize the house area attribute to gain a deeper understanding of the socioeconomic status of students' families, where students with houses significantly larger than the mean might indicate a higher socioeconomic status, while those with houses considerably smaller than the mean might reflect a comparatively lower socioeconomic status. Based on the shown output, the house areas range from [20.00-120.00 ㎡]. The median house area is 75.50 ㎡ indicates that families with house areas around this value likely have moderate socioeconomic status with houses that neither very small nor very large.

3.  Understanding Parent Age Range and Variation in its Values

```{r}
summary(Dataset$parent_age)
SD=sd(Dataset$parent_age)
MeanAge=mean(Dataset$parent_age)
cat("coefficient of variation:",SD/MeanAge*100,"%")

```

This summary provides the range for age attribute [40,65] which indicates that all parent in middle age during this age parent have more concern about their children , the coefficient of variation= 6.7% which indicates lower variation ,and the value of attribute parent_agerare are relatively close to the mean overall 25% of them have an age below or equal to 50 , 75% have an age below or equal to 54 and the median value is 52

## Outliers analysis

```{r}
###parent age outliers
quartiles <- quantile(Dataset$parent_age, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(Dataset$parent_age)

Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR 

data_no_outlier <- subset(Dataset, Dataset$parent_age > Lower & Dataset$parent_age < Upper)
dim(data_no_outlier)

###parent salary outliers
quartiles <- quantile(Dataset$parent_salary, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(Dataset$parent_salary)

Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR 

data_no_outlier <- subset(data_no_outlier, data_no_outlier$parent_salary> Lower & data_no_outlier$parent_salary < Upper)
dim(data_no_outlier)


###averge grades outliers
quartiles <- quantile(Dataset$average_grades, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(Dataset$average_grades)

Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR 

data_no_outlier <- subset(data_no_outlier, data_no_outlier$average_grades> Lower & data_no_outlier$average_grades < Upper)
dim(data_no_outlier)


###house area outliers
quartiles <- quantile(Dataset$house_area, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(Dataset$house_area)

Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR 

data_no_outlier <- subset(data_no_outlier, data_no_outlier$house_area> Lower & data_no_outlier$house_area < Upper)

Founded_Outliers=data.frame(anti_join(Dataset,data_no_outlier))
print(Founded_Outliers)
```

After conducting data analysis and identifying outliers, our inspection reveals that the detected outliers represent inherent variation within the population. Regarding Parent_age, outliers are observed for values below 44 and above 65. However, it should be noted the age from 40 to 65 fall within the expected mean of our dataset meaning that it doesn't indicate that they are outliers . For parent_salary, we found two outliers: one below 1,326,250 ind ≈ 85 USD and another above 9,416,250 ind ≈ 606 USD. The minimum and maximum values were determined to be 1,000,000 ind ≈ 64 USD and 10,000,000 ind ≈ 644 USD, respectively. In the case of grades, twelve outliers were identified, ranging from below 76 to above 97. Nevertheless, since the data falls within the acceptable range of 0 to 100, these outliers should be retained as they are still considered normal and within the usual grade range. Finally, for house_area, we found eleven outliers below 34.4m and above 115m, with the minimum being 20m and the maximum being 120m. However, these values are still considered typical for the population.

## Normalization

```{r}
normalize <- function(x) {return((x-min(x))/ (max(x)-min(x)))}
datasetWithoutNormalization<-Dataset
Dataset$parent_salary<-normalize(datasetWithoutNormalization$parent_salary)
Dataset$house_area<-normalize(datasetWithoutNormalization$house_area)
print(Dataset)
```

We applied normalization to the 'parent_salary' and 'house_area' attributes, scaling their values to a range between 0 and 1. This normalization process greatly facilitates data handling and analysis, ensuring that these attributes are on a consistent scale. Which will improve the reliability of our data analysis and enable better conclusions to be drawn from the dataset. Normalization is a crucial step in preparing the data for modeling, as it prevents attributes with larger numerical ranges from dominating the analysis and ensures fair treatment for all features.

## Discretization

```{r}


Dataset$average_grades [Dataset$average_grades >= 95] <- '+A'
Dataset$average_grades [95 >Dataset$average_grades & Dataset$average_grades >= 90] <- 'A'
Dataset$average_grades [90 >Dataset$average_grades & Dataset$average_grades >= 85] <- '+B'
Dataset$average_grades [85 >Dataset$average_grades & Dataset$average_grades >= 80] <- 'B'
Dataset$average_grades [80 >Dataset$average_grades & Dataset$average_grades >= 75] <- '+C'
Dataset$average_grades [75 >Dataset$average_grades & Dataset$average_grades >= 70] <- 'C'
Dataset$average_grades [70 >Dataset$average_grades & Dataset$average_grades >= 65] <- '+D'
Dataset$average_grades [65 >Dataset$average_grades & Dataset$average_grades >= 60] <- 'D'
Dataset$average_grades [60 >Dataset$average_grades & Dataset$average_grades >= 0] <- 'F'
Dataset$average_grades <- as.character(Dataset$average_grades )
print(Dataset)
```

We transformed the parent_age attribute into intervals by dividing the values to be fall on one of two possible interval labels with equal width which is(40,50],(50,60] by discretization the values well be simpler to classify or perform other methods that can help us later in our model.

and to better utilize and interpret the grades attributes for each student, we have converted the numeric grades into letter grades (A+, A, B+, B, C+, C, D+, D, F). This transformation was undertaken to focus on the general letter grade representation rather than the precise numerical values.

## Encoding

```{r}
Dataset$parent_was_in_college[Dataset$parent_was_in_college=="TRUE"]<-1
Dataset$parent_was_in_college[Dataset$parent_was_in_college=="True"]<-1
Dataset$parent_was_in_college[Dataset$parent_was_in_college=="FALSE"]<-0
Dataset$parent_was_in_college[Dataset$parent_was_in_college=="False"]<-0

Dataset$will_go_to_college[Dataset$will_go_to_college=="TRUE"]<-0
Dataset$will_go_to_college[Dataset$will_go_to_college=="True"]<-0
Dataset$will_go_to_college[Dataset$will_go_to_college=="FALSE"]<-1
Dataset$will_go_to_college[Dataset$will_go_to_college=="False"]<-1

Dataset$gender[Dataset$gender=="Female"]<-1
Dataset$gender[Dataset$gender=="Male"]<-0

Dataset$school_accreditation[Dataset$school_accreditation=="A"]<-1
Dataset$school_accreditation[Dataset$school_accreditation=="B"]<-0

Dataset$interest[Dataset$interest=="Very Interested"]<-4
Dataset$interest[Dataset$interest=="Interested"]<-3
Dataset$interest[Dataset$interest=="Less Interested"]<-2
Dataset$interest[Dataset$interest=="Not Interested"]<-1
Dataset$interest[Dataset$interest=="Uncertain"]<-0


Dataset$type_school[Dataset$type_school=="Academic"]<-1 
Dataset$type_school[Dataset$type_school=="Vocational"]<-0

Dataset$residence[Dataset$residence=="Urban"]<-1
Dataset$residence[Dataset$residence=="Rural"]<-0
print(Dataset)
```

Since encoding is an important step in data preprocessing that enables the use of categorical data in various data analysis and machine learning tasks, we encoded attributes like the 'parent was in college' attribute from (True, False) to (1, 0), and 'will go to college' from (True, False) to (0, 1). This encoding is carried out as we aim to predict the influencing factors. Additionally, we encoded the 'gender' attribute from (Female, Male) to (1, 0), 'school accreditation' from (A, B) to (1, 0), 'type_school' from (Academic, Vocational) to (1, 0), 'residence' from (Urban, Rural) to (1, 0), and 'interest' from (Very interested ,Interested , Less Interested , Not Interested ,Uncertain ) to (4,3,2, 1, 0) respectively. Encoding serves to simplify the data, reduce complexity, and enhance its suitability for modeling purposes.

## Correlation analysis Chi square test for nominal attribute:

```{r}

#1
C=chisq.test(Dataset$type_school , Dataset$will_go_to_college)
print(C)
#2

C=chisq.test(Dataset$school_accreditation , Dataset$will_go_to_college)
print(C)
#3
C=chisq.test(Dataset$gender , Dataset$will_go_to_college)
print(C)
#4
C=chisq.test(Dataset$interest , Dataset$will_go_to_college)
print(C)

#5
C=chisq.test(Dataset$residence , Dataset$will_go_to_college)
print(C)

#6
C=chisq.test(Dataset$average_grades , Dataset$will_go_to_college)
print(C)

#7
C=chisq.test(Dataset$parent_was_in_college , Dataset$will_go_to_college)
print(C)
```

All the attributes have X-square greater than the p-value which indicate a some association with the class label; therefore we reject the null hypothesis

we noticed for 'interest' and 'average grade' the analysis shows that X-square is much larger than p-value indicate the significant association of the two attributes with the decision of the student to go to the collage or not

## Correlation coefficient analysis for numeric attribute:

```{r}

biserial.cor(Dataset$parent_salary,Dataset$will_go_to_college, c("all.obs", "complete.obs"), level = 1)
biserial.cor(Dataset$house_area,Dataset$will_go_to_college, c("all.obs", "complete.obs"), level = 1)
biserial.cor(Dataset$parent_age,Dataset$will_go_to_college,c("all.obs", "complete.obs"), level = 1)
 
 

```

the analysis shows moderate correlation coefficient for parent salary and house area with the class label which indicate that they are relevant factors meaning that the higher the parent salary and the larger house area the higher probability for a student to enroll in a collage

where is the on other hand, the correlation coefficient for the parent age is very small which indicate that the parent age has little impact to the probability for student to enroll in a collage

## Feature selection:

ultimately based on the analysis of the correlation that we conducted on the relationship of the dataset attributes with the class label, and the understanding of the data and the context of each attribute and potential relevance to the class label we decided to not delete any of the attribute

## Classification:

### factor the data

```{r}

data <- Preprocessed_dataset

data$will_go_to_college <- factor(data$will_go_to_college, levels = c("1", "0"), labels = c("1", "0"))
data$residence <- factor(data$residence, levels = c("1", "0"), labels = c("1", "0"))
data$gender <- factor(data$gender, levels = c("1", "0"), labels = c("1", "0"))
data$parent_was_in_college <- factor(data$parent_was_in_college, levels = c("1", "0"), labels = c("1", "0"))
data$interest <- factor(data$interest, levels = c("4","3","2","1", "0"), labels = c("4","3","2","1", "0"))
data$type_school <- factor(data$type_school, levels = c("1", "0"), labels = c("1", "0"))
data$school_accreditation <- factor(data$school_accreditation, levels = c("1", "0"), labels = c("1", "0"))

data$average_grades <- factor(data$average_grades, levels = c("+A", "A","+B","B","+C","C","+D","D","F"), labels = c("+A", "A","+B","B","+C","C","+D","D","F"))
str(data)
```

### balanced or imbalanced

```{r}

library(tidyverse)  
library(caret)
hist(data$will_go_to_college,col="coral") prop.table(table(data$will_go_to_college))
```

we want to confirm that the distribution between the two label data is not too much different. Because imbalanced datasets can lead to imbalanced accuracy.

Fortunately ,our data is balanced

### partition method

We opted for cross-validation as our partition method owing to the constraints posed by limited data availability. To ensure robustness in our evaluation, we employed three distinct values for k folds 2, 3, and 4. we chose small k folds because of our small data

### c4.5

```{r}
library(caret)
library(rpart)
library(dplyr)
library(rpart.plot)


set.seed(123)
fold <- c(2, 3, 4)  # Values of xval (number of folds) to try

for (fold in fold) {
  cat("fold =", fold, "\n")
  trctrl <- trainControl(method = "cv", number = fold, savePredictions = TRUE)
  c45_fit <- train(will_go_to_college ~ ., data = data, method = "J48", trControl = trctrl)
   print(c45_fit$finalModel)
  # Get predicted values
  predictions <- predict(c45_fit, newdata = data)
  
  # Create confusion matrix
  confusion_matrix <- confusionMatrix(predictions, data$will_go_to_college)
  
  # Print the confusion matrix
  print(confusion_matrix)
  
  # Print accuracy for each fold
  pred <- c45_fit$pred
  pred$equal <- ifelse(pred$pred == pred$obs, 1, 0)
  eachfold <- pred %>%
    group_by(Resample) %>%
    summarise_at(vars(equal),
                 list(Accuracy = mean))
  print(eachfold)
    # Plot decision tree
plot(c45_fit$finalModel, main = fold)
  
}
```

The gain ratio consistently favors unbalanced splits, as demonstrated by its selection of "Parent salary" as the root for all three trees even though it's shown in the tree "average grades" as the root but the split point's that all the value in one diraction . In this configuration, one partition is notably smaller than the others, and the feature exhibits a higher number of distinct values. Despite the fact that the node corresponding to "Parent age", "parent was in collage" is not pure, the resulting trees exhibit impressive accuracy levels, all surpassing 94%

### cart

```{r}
library(caret)
library(rpart)
library(rpart.plot)



set.seed(1234) # Random seed
fold <- c(2, 3, 4)  # Values of xval (number of folds) to try

for (fold in fold) {
cat("fold"=fold)
trctrl <- trainControl(method = "cv", number = fold, savePredictions = TRUE)
dt_fit <- train(factor(will_go_to_college) ~ ., data = data, method = "rpart1SE", trControl = trctrl)

 # Plot decision tree
  rpart.plot(dt_fit$finalModel)
  
# Get predicted values
predictions <- predict(dt_fit, newdata = data)

# Create confusion matrix
confusion_matrix <- confusionMatrix(predictions, data$will_go_to_college)

# Print the confusion matrix
print(confusion_matrix)

#print each flod
pred <- dt_fit$pred
pred$equal <- ifelse(pred$pred == pred$obs, 1,0)
eachfold <- pred %>%                                        
  group_by(Resample) %>%                         
  summarise_at(vars(equal),                     
               list(Accuracy = mean))              
print(eachfold  )

  }
```

averages grade exhibits the smallest Gini index binary split, signifying a substantial reduction in impurity. Hence, it is chosen as the splitting attribute. Conversely, attributes such as 'type_school,' 'school_accreditation,' 'gender,' 'parent_age,' and 'parent_was_in_college' yield minimal impurity reduction, leading to their exclusion from the tree. The dataset's balanced class labels and marginal differences in accuracy across folds result in consistent tree structures, as evidenced by the identical trees in all folds. For further details, refer to the [index]. Overall, the model attains an 86% accuracy, emphasizing its effectiveness.

### ID3

```{r}
library(caret)
library(partykit)
library(dplyr)

fold <- c(4, 3, 2)  # Values of xval (number of folds) to try

for (fold in fold) {
  cat("fold =", fold, "\n")
  trctrl <- trainControl(method = "cv", number = fold, savePredictions = TRUE)
  
  id_fit <- train(factor(will_go_to_college) ~ ., data = data, method = "ctree", trControl = trctrl)
   print(id_fit$finalModel)
  # Get predicted values
  predictions <- predict(id_fit, newdata = data)
  
  # Create confusion matrix
  confusion_matrix <- confusionMatrix(predictions, data$will_go_to_college)
  
  # Print the confusion matrix
  print(confusion_matrix)
  
  # Print accuracy for each fold
  pred <- id_fit$pred
  pred$equal <- ifelse(pred$pred == pred$obs, 1, 0)
  eachfold <- pred %>%
    group_by(Resample) %>%
    summarise_at(vars(equal), list(Accuracy = mean))
  print(eachfold)

  # Plot decision tree
  plot(id_fit$finalModel, main = paste("Decision Tree (Fold", fold, ")"))
  
}
```

Additional insights reveal that attributes such as school accreditation, parent was in collage contribute to high impurity. In contrast, Parent salary is chosen as the root due to its high purity. Given the balanced class labels in our dataset and minimal variations in accuracy across folds, the result yields consistent tree structures, with only two distinct trees observed for all folds. For further details, please refer to the [index]. The overall accuracy consistently surpasses 86%, affirming the model's efficacy

### final analysis

The C4.5 model emerged as the top-performing evaluation model, achieving an impressive accuracy rate of 94% to 97%. It was followed by the ID3 model, which demonstrated slightly lower accuracy ranging from 86% to 89%. Lastly, the cart model exhibited an accuracy rate of 86%.

the C4.5 gave better result than ID3 and Cart because they both are biased to multivalued where C4.5 normalized parent salary and house area which are multivalue attributes

C4.5 and ID3 models, the parent's salary served as the root feature, indicating that the financial circumstances of the student are a crucial factor for contemporary universities.

## index

### all trees of C4.5

```{r}
library(caret)
library(rpart)
library(dplyr)
library(rpart.plot)

folds <- c(2, 3, 4)  # Values of xval (number of folds) to try

for(folds in folds){
for (fold_val in 1:folds) {
  cat("fold =", fold_val, "\n")
  
  trctrl <- trainControl(method = "cv", number = fold_val, savePredictions = TRUE)
  c45_fit <- train(factor(will_go_to_college) ~ ., data = data, method = "J48", trControl = trctrl)
  
   plot(c45_fit$finalModel, main = paste("Decision Tree - Fold", fold_val))
}
  # Get predicted values
  predictions <- predict(c45_fit, newdata = data)
  
  # Create confusion matrix
  confusion_matrix <- confusionMatrix(predictions, data$will_go_to_college)
  
  # Print the confusion matrix
  print(confusion_matrix)
  
  # Print accuracy for each fold
  pred <- c45_fit$pred
  pred$equal <- ifelse(pred$pred == pred$obs, 1, 0)
  eachfold <- pred %>%
    group_by(Resample) %>%
    summarise_at(vars(equal),
                 list(Accuracy = mean))
  print(eachfold)
  
  # Plot decision tree for each fold
 

}
```

## Clustering **Analysis:**

In this analysis, we apply K-means clustering to the dataset using different values of K. K-means clustering is an unsupervised learning algorithm that partitions the data into K clusters based on similarity. We will explore three different values of K and evaluate the clustering results using various metrics.

### Removing the class label and preparing the dataset for Clustering

```{r}
 
original_data <- Preprocessed_dataset

# Remove any non-numeric attributes
numeric_data <- original_data[, sapply(original_data, is.numeric)]

# Remove the class label 'will_go_to_college'
numeric_data <- numeric_data[, !(names(numeric_data) == 'will_go_to_college')]


# Print the dataset "numeric_data" to make sure it's prepared for clustering
print(numeric_data)

# Scaling the dataset
# numeric_data <- scale(numeric_data)

 
```

Now, the 'numeric_dataset' dataset contains only numeric attributes without the class label, which makes it ready for the clustering process.

## K=2

```{r}
# k-means clustering set a seed for random number generation to make the results reproducible 
set.seed(8953)

# run kmeans clustering to find 2 clusters
kmeans.result <- kmeans(numeric_data, 2)

# visualize clustering
library(factoextra)
fviz_cluster(kmeans.result, data = numeric_data)

# print the clustering result
print(kmeans.result)


```

#### The Silhouette coefficient

```{r}
#average for each cluster 
avg_sil <- silhouette(kmeans.result$cluster, dist(numeric_data)) 

#k-means clustering with estimating k and initializations 
fviz_silhouette(avg_sil)

```

#### The total within-cluster sum of squares

```{r}
# Calculate total within-cluster sum of squares
total_withinss <- kmeans.result$tot.withinss
cat("Total Within-Cluster Sum of Squares:", sum(total_withinss), "\n")

true_labels <- c(1, 1, 2, 1, 2, 2, 3, 3, 4, 4)  # Adjust based on your actual true labels

cluster_assignments <- kmeans.result$cluster
 
```

#### BCubed recall and precision

```{r}

# Calculate BCubed precision
precision <- 0
for (i in unique(true_labels)) {
  cluster_indices <- which(true_labels == i)
  precision <- precision + sum((table(cluster_assignments[cluster_indices]) * (table(cluster_assignments[cluster_indices]) - 1)) / sum(table(cluster_assignments[cluster_indices])))
}
precision <- precision / sum(table(cluster_assignments))

# Calculate BCubed recall
recall <- 0
for (j in unique(cluster_assignments)) {
  cluster_indices <- which(cluster_assignments == j)
  recall <- recall + sum((table(true_labels[cluster_indices]) * (table(true_labels[cluster_indices]) - 1)) / sum(table(true_labels[cluster_indices])))
}
recall <- recall / sum(table(true_labels))

cat("BCubed Precision:", precision, "\n")
cat("BCubed Recall:", recall, "\n")


```

## K=4

```{r}
# k-means clustering set a seed for random number generation to make the results reproducible 
set.seed(8953)

# run kmeans clustering to find 4 clusters
kmeans.result <- kmeans(numeric_data, 4)

# visualize clustering
library(factoextra)
fviz_cluster(kmeans.result, data = numeric_data)

# print the clustering result
print(kmeans.result)


```

#### The Silhouette coefficient

```{r}
#average for each cluster 
avg_sil <- silhouette(kmeans.result$cluster, dist(numeric_data)) 

#k-means clustering with estimating k and initializations 
fviz_silhouette(avg_sil)

```

#### The total within-cluster sum of squares

```{r}
# Calculate total within-cluster sum of squares
total_withinss <- kmeans.result$tot.withinss
cat("Total Within-Cluster Sum of Squares:", sum(total_withinss), "\n")

true_labels <- c(1, 1, 2, 1, 2, 2, 3, 3, 4, 4)  # Adjust based on your actual true labels

cluster_assignments <- kmeans.result$cluster
 
```

#### BCubed recall and precision

```{r}

# Calculate BCubed precision
precision <- 0
for (i in unique(true_labels)) {
  cluster_indices <- which(true_labels == i)
  precision <- precision + sum((table(cluster_assignments[cluster_indices]) * (table(cluster_assignments[cluster_indices]) - 1)) / sum(table(cluster_assignments[cluster_indices])))
}
precision <- precision / sum(table(cluster_assignments))

# Calculate BCubed recall
recall <- 0
for (j in unique(cluster_assignments)) {
  cluster_indices <- which(cluster_assignments == j)
  recall <- recall + sum((table(true_labels[cluster_indices]) * (table(true_labels[cluster_indices]) - 1)) / sum(table(true_labels[cluster_indices])))
}
recall <- recall / sum(table(true_labels))

cat("BCubed Precision:", precision, "\n")
cat("BCubed Recall:", recall, "\n")


```

## K=6

```{r}
# k-means clustering set a seed for random number generation to make the results reproducible 
set.seed(8953)

# run kmeans clustering to find 6 clusters
kmeans.result <- kmeans(numeric_data, 6)

# visualize clustering
library(factoextra)
fviz_cluster(kmeans.result, data = numeric_data)

# print the clustering result
print(kmeans.result)


```

#### The Silhouette coefficient

```{r}
#average for each cluster 
avg_sil <- silhouette(kmeans.result$cluster, dist(numeric_data)) 

#k-means clustering with estimating k and initializations 
fviz_silhouette(avg_sil)

```

#### The total within-cluster sum of squares

```{r}
# Calculate total within-cluster sum of squares
total_withinss <- kmeans.result$tot.withinss
cat("Total Within-Cluster Sum of Squares:", sum(total_withinss), "\n")

true_labels <- c(1, 1, 2, 1, 2, 2, 3, 3, 4, 4)  # Adjust based on your actual true labels

cluster_assignments <- kmeans.result$cluster
 
```

#### BCubed recall and precision

```{r}

# Calculate BCubed precision
precision <- 0
for (i in unique(true_labels)) {
  cluster_indices <- which(true_labels == i)
  precision <- precision + sum((table(cluster_assignments[cluster_indices]) * (table(cluster_assignments[cluster_indices]) - 1)) / sum(table(cluster_assignments[cluster_indices])))
}
precision <- precision / sum(table(cluster_assignments))

# Calculate BCubed recall
recall <- 0
for (j in unique(cluster_assignments)) {
  cluster_indices <- which(cluster_assignments == j)
  recall <- recall + sum((table(true_labels[cluster_indices]) * (table(true_labels[cluster_indices]) - 1)) / sum(table(true_labels[cluster_indices])))
}
recall <- recall / sum(table(true_labels))

cat("BCubed Precision:", precision, "\n")
cat("BCubed Recall:", recall, "\n")


```

### The optimal number of clusters

To find the optimal number of clusters to use in the k-means algorithm, we'll use the **fviz_nbclust()** function from the **factoextra**package to create a plot of the number of clusters vs. the total within sum of squares

```{r}
# Function to calculate total within-cluster sum of squares (wss)
wss <- function(k) {
  kmeans_result <- kmeans(numeric_data, centers = k, nstart = 10)  # You can adjust nstart based on your preference
  return(sum(kmeans_result$tot.withinss))
}

# Calculate the total within-cluster sum of squares for different values of k
k_values <- 1:10  # You can adjust the range of k values
wss_values <- sapply(k_values, wss)

# Plot the elbow curve
plot(k_values, wss_values, type = "b", pch = 19, frame = FALSE, 
     xlab = "Number of Clusters (k)", ylab = "Total Within-Cluster Sum of Squares (WSS)",
     main = "Elbow Method")

# Adding a line to indicate the "elbow"
abline(v = which(diff(wss_values) == max(diff(wss_values))) + 1, col = "red")


```

According to the output the best number of clusters is one,

